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The ability to learn and perform statistical inference with biologically plausible recurrent 
networks of spiking neurons is an important step toward understanding perception 
and reasoning. Here we derive and investigate a new learning rule for recurrent 
spiking networks with hidden neurons, combining principles from variational learning and 
reinforcement learning. Our network defines a generative model over spike train histories 
and the derived learning rule has the form of a local Spike Timing Dependent Plasticity 
rule modulated by global factors (neuromodulators) conveying information about "novelty" 
on a statistically rigorous ground. Simulations show that our model is able to learn both 
stationary and non-stationary patterns of spike trains. We also propose one experiment 
that could potentially be performed with animals in order to test the dynamics of the 
predicted novelty signal. 

Keywords: neural networks, variational learning, spiking neurons, synapses, action potentials 



1. INTRODUCTION 

Humans and animals are able to learn complex behavioral tasks 
and memorize events or temporally structured episodes. Most 
likely, learning and memory formation are intimately linked 
to changes in the synaptic connection strength between neu- 
rons. Long-term potentiation and depression of synapses can be 
induced by many different experimental protocols, and depend 
on voltage (Artola and Singer, 1993; Ngezahayo et al, 2000), 
spike-timing (Markram et al, 1997; Bi and Poo, 2001) as well as a 
subtle combination of timing, voltage, and frequency (Sjostrom 
et al., 2001; Clopath et al., 2010). Spike-Timing Dependent 
Plasticity (STDP) has intrigued theoreticians, because it provides 
a local Hebbian learning rule for spiking neurons; local, here, 
means that the dynamics of the synapses is of the form oc 
h{posti, prcj), where prCj is the set of pre-synaptic variables of 
neuron j (e.g., spike timing) and posf,- is the set of post-synaptic 
variables of neuron i (e.g., spike times and voltage) and h is an 
arbitrary functional. 

Unsupervised learning through STDP has been repeatedly 
shown (Levy et al., 2001; Song and Abbott, 2001; Izhikevich et al., 
2004; Morrison et al, 2007; Gateau et al, 2008; Gilson et al, 2009; 
Glopath et al, 2010) to yield connectivity structures that leads to 
non-trivial activity patterns in recurrent spiking networks. 

With relation to neuroscience, unsupervised learning is most 
commonly related to developmental plasticity (Miller et al., 
1989), formation of receptive fields (Song and Abbott, 2001) or 
cortical rewiring (Young et al., 2007). Indeed most early appli- 
cations of unsupervised STDP concern the learning of feedfor- 
ward connections and the formation of receptive fields (Gerstner 
et al, 1996; Kempter et al, 1999; Song et al., 2000; Song and 
Abbott, 2001). Unsupervised STDP will tune to the earliest spikes 
(Song and Abbott, 2001; Gerstner and Kistler, 2002; Guyonneau 



et al., 2005) and can perform Independent Gomponent Analysis 
(Glopath et al, 2010; Savin et al, 2010). 

On the level of behavioral neuroscience, human performance 
approaches in many psychophysical learning paradigms Bayes 
optimality, i.e., the best statistical model cannot perform better 
than humans do (Knill and Pouget, 2004; Kording and Wolpert, 
2004). This supports the hypothesis that the brain is performing 
approximate inference, which implies that the brain has access to 
prior and posterior distribution of possible explanations of the 
observed data (Berkes et al., 201 1). 

These findings lead to the idea that the spiking activity of 
the brain constitutes a generative model, that is, a model of 
the joint distribution of percepts (observed spike trains induced 
by sensors) and hidden causes in the world (hidden spike train 
generated by neurons that are not directly affected by sensor 
spikes). 

The ability to model hidden causes in the sensory data is 
important for both stationary and non-stationary situations. 
For stationary distributions of spike trains, hidden neurons are 
important to encode potential interpretations or explanations of 
spike patterns observed at the sensory neurons (visible neurons). 
For non-stationary sequences, hidden neurons are fundamen- 
tal for representing the non-observed dynamics and to form 
long-term memories. 

Various abstract Bayesian models have been proposed to 
account for this phenomenon (Kording and Wolpert, 2004; 
Deneve, 2008; Nessler et al., 2009). However, it remains an 
open question whether optimization in abstract Bayesian models 
can be translated into plausible learning rules for spiking neu- 
rons. If one considers only stationary input patterns, an explicit 
relation between Bayesian inference and synaptic plasticity has 
been suggested (Habenschuss et al, 2012). Moreover, it has been 
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suggested recently (Pecevski et al., 2011; Shao, 2012) that spiking 
networks with biologically plausible dynamics can produce sta- 
tionary samples from Deep Boltzmann Machines (Salakliutdinov 
and Hinton, 2009) and more general Bayesian networks. In the 
following we drop the limitations of stationary spatial patterns 
and consider a recurrent network of stochastically spiking neu- 
rons as a generative model of spatio-temporal spiking patterns. 

Learning the synapses between hidden neurons in recurrent 
models has been recognized as a difficult problem, as these 
synapses only contribute indirectly to the activity of the visible 
neurons and as recurrence leads to the "vanishing gradient" prob- 
lem (Bengio et al., 1994). Conversely, in machine learning the 
algorithms known to be efficient for learning graphical and recur- 
rent models are typically non-local (Jaakkola and Jordan, 2000; 
Bhatnagar et al, 2007; Sutskever et al., 2008; Salakhutdinov and 
Hinton, 2009). That is, in order to efficiently perform param- 
eter updates in these models, their learning rules either take 
into account the entire state of the model or it requires that 
information to propagate in a non-causal maner through the 
synapses. 

Therefore, the locality constraints imposed by biology con- 
stitutes one of the major challenges in transforming those 
algorithms into biologically plausible learning rules. 

Even in biology, however, certain types of non-local signals 
participate in the learning process, notably through neuromod- 
ulators which can convey information about the global state of 
the network or external information (e.g., reward or surprise) 
(Izhikevich, 2007; Schuhz, 2008; Fremaux et al, 2010). 

Here, we derive a principled learning rule for unsupervised 
learning in recurrent spiking networks that relies only on quanti- 
ties that are locally available at the synapse: pre-synaptic activity, 
post-synaptic activity and a global modulatory signal. 

The key innovation of our model compared to earlier stud- 
ies (Brea et al., 2011; Jimenez Rezende et al., 2011) lies in the 
computation of a global modulating signal which is a linear super- 
position of local terms and can therefore be interpreted as the 
diffusion of a neuromodulator in the extra-cellular medium. 

Furthermore our global neuromodulating signal conveys 
information about novelty or surprise on statistically rigorous 
grounds, providing interesting links with findings relating sur- 
prise and plasticity (Gu, 2002; Ranganath and Rainer, 2003; Yu 
and Dayan, 2005). 

We show with simulations based on synthetic data that our 
proposed learning mechanism is capable to capture complex hid- 
den causes behind the observed spiking patterns and is able to 
replicate, in its spontaneous activity, the statistics of the observed 
spike trains. 

Finally, we provide an application of our model to a hypothet- 
ical novelty detection task where a simulated agent (e.g., a rat) 
is inserted into a maze with specific properties (views, rooms, 
topology of the maze). Our model, simulating the "brain" of this 
agent, successfully captures the statistical properties of this envi- 
ronment. We show that, after learning, our model is capable of 
distinguishing the original environment from another environ- 
ment that differs only in its topology (relative location of the 
rooms). Additionally, we predict the "expected dynamics" of a 
neuromodulator signaling "novelty" as the agent traverses the 



virtual maze. We propose that this hypothetical experiment could 
be developed into a real experiment. 

2. MATERIALS AND METHODS 
2.1. NEURON MODEL 

The neuron model used in our simulations is a generalized linear 
model (GLM) in the form of a Spike Response Model (SRM) with 
escape noise (Gerstner and Kistler, 2002; Jolivet et al., 2006). The 
spike train of a neuron j for times t > 0 is denoted as Xj{t) = 

^f[€ jf' t'*' j ^ ~ where |f', . . . , fj^^j is the set of spike 

timings. 

We model the membrane potential of a neuron i as 



M,(t) = ^w,j(Pj{t) + rj,{t), 



(1) 



where r]j(t) = —r]Q dse ^"''^i" Xj(s) is the adaptation potential, 
Wij is the synaptic strength between neurons i and and </>j(f) 
is the potential evoked by an incoming spike from neuron j. 
The evoked potentials are modeled by a simple exponential filter 



0;(f) 



[dse r Xj(s) implemented as a differential equation 



j>jit) = -(Xj(t)-<Pj{t)), 



(2) 



where r is the time constant of the membrane potential. 

The spikes are generated by a conditioned Poisson process 
with exponential escape rate (Jolivet et al., 2006). That is, the 
conditional instantaneous firing intensity p,(f) is taken to be 



p,(f) = poexp 



M,(f) - !? 

Ah 



(3) 



where & and Am are physical constants of the neuron. However, 
we will keep p,(f) as an arbitrary function of u,(t), pi(t) = 
g{Ui{t)), in all our derivations and we specify the exponential 
form (Equation 3) only when performing the simulations (see 
further below). Equations (1-3) capture the simplified dynamics 
of a spiking neuron with stochastic spike firing. 

In the following simulations we assume that two different neu- 
rons i and j can have at most two common synapses, Wy and 
wj,. The neuron model and the potentials contributing to its 
activation are illustrated in Figure 1. 

In the following sections, we will first introduce the theo- 
retical framework in which we derive our learning rule. The 
learning mechanism is then derived in several steps, followed 
by simulations showing that our model and learning rules are 
capable of capturing complex spatio-temporal features in the 
input spike trains, reproduce them in its spontaneous activ- 
ity and perform statistical inference on the hidden causes of 
provided data. 

2.2. A PRINCIPLED FRAMEWORK FOR LEARNING 

In the following we consider a stochastic, fully connected net- 
work A4 composed of two sets of neurons which are functionally 
distinct. The first group which we call observed or visible neurons 
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(denoted by V) represents an ensemble of neurons which receive 
data in the form of spike trains. The spike trains of the visible 
neurons wiU be referred to as Jfy. These neurons are exposed to 
the external world. The second set, which we call hidden neurons 
(denoted by Ti) does not directly receive data from the external 
world. The spike trains of the hidden neurons will be referred 
to as Xn . Their role is to provide "compressed explanations" for 
the observed data. The topology of this network is illustrated 
in Figure 2C. In the absence of external drive, the spontaneous 
activity of the hidden neurons will contribute to the firing of 
the observed neurons. The spike trains of the entire network 



comprising both observed and hidden neurons will be indicated 
simply as X. 

Our network defined in this way constitutes a generative model 
of spike trains with hidden neurons. In the following we interpret 
synaptic potentiation and depression as a form of optimization 
of this generative model. More precisely, we assume that synaptic 
plasticity (the "learning rule") is trying to increase the likelihood 
of the observed spike trains under the model. 

In what follows, we review the calculation of the complete-data 
log-likelihood \ogp{Xy , Xu) for a recurrent network of point- 
process neurons. 




0 20 40 60 80 100 
Time(ms) 



FIGURE 1 I Neuron and synaptic models. Illustration of the different contributions to the total membrane potential of a neuron / in our model. 




neurons 



FIGURE 2 I The different network architectures discussed in this study. 

(A) Feed-forward, fully observed spiking network. It defines a conditional 
model of the target spiking patterns given the input patterns. (B) Fully 
connected and fully observed spiking network. It defines a generative model 
of the observed data. (C) Network decomposed in two pools of neurons: 



neurons which receive data or observed neurons V and neurons which do 
not receive data or hidden neurons H. Both pools of neurons are fully 
connected. (D) The network with Al-synapses (solid links) and Q-synapses 
(dashed links) which provides the infrastructure required for learning the 
generative model. 
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From Equations (1-3) it follows that the probability 
Pi (/■ €[t,t+ At]\X{0 . . . o) of producing a spike in the 
infinitesimal interval [f, f + Af] by the rth neuron conditioned 
on the past activity of the entire network X(0 . . . f ) and provided 
that p,(f) Af <SC 1 is given by 



P, e [f, f-l- Af]|X(0. . . f)^ ^ p,{t)At, 



(4) 



and the probability P,- (/^ ^ [t, At]\X{0 ...t)j of producing no 
spike in the same interval is given by 



P, (f{ ^ [f, f + At]\X{0 . . . o) «^ (1 - Afp,(f))- 



(5) 



Therefore, by discretizing a finite time interval [0, T] into N suffi- 
ciently small bins [tk, + At] with k = 1 ... N so that there is at 
most one spike in each bin and assuming independence between 
neurons within the infinitesimal bins we can write the probability 

P(X(0 . . . P)) of producing a spike train X,(f) = T,^ S (^t - /-^ 

for all neurons in the network as 



in detail in Paninski (2004) and Pfister et al. (2006). In what 
follows we review the calculation of the gradients of the data log- 
likelihood (Equation 8) with respect to the synaptic weights and 
then discuss a simulation revealing the limitations of the model. 

In a network without hidden neurons as illustrated in 
Figures 2A,B, the data log-likelihood (Equation 8) reduces to a 
simple form 

\ogp(Xv)=Tf dr[\ogpk(r)Xk(r)- pk(t)]. (9) 

In Paninski (2004) and Pillow et al. (2004) it is shown that the 
optimization problem defined by the log-likelihood (Equation 9) 
is convex. Therefore its global maximum can be found by gradient 
ascent. 

The gradient of Equation (9) with respect to the synaptic 
efficacies w,j is given by 



V^,,log|^(Xv) 



dHPk(t') 
dWii 



[Xk(t')-pk{t')i{m 



P(X(0 ...T))^ Yl n [P' (^) n [l - ^^P' firing rate fimction (EquiLn 3) 



where the gradients * are obtained by differentiating the 



ieVUH k". 



(6) 



d\0gpk(t') 



where ¥- and fc"* labels the bins with one spike and bins with- 
out spike from neuron i, respectively. Taking the limit N ^ oo 
of Equation (6) divided by the volume Af^, where K is the total 
number of spikes in the interval [0, T] we obtain the probability 
density 



(11) 



p{X(Q...T)) 



n 



exp 



(-r 



dtp,{t)\.{7) 



The log-likelihood corresponding to Equation (7) can be com- 
pactly written as 

logp(X(0 ...T))= ( dr [log p,(r)X,(T) - p,(r)] . (8) 

It should be stressed that the log-likelihood (Equation 8) is not a 
sum of independent terms, since the instantaneous firing rate of 
each neuron depends on the entire past activity of all the other 
neurons through Equations (1-3). 

In the following sections we derive a plasticity rule that will 
attempt to maximize the log-likelihood of the observed data 
logp(Xv). For this, we briefly review the calculation of the gra- 
dient of the observation likelihood in absence of hidden neurons 
and then we introduce our method for approximating its gradient 
when there are hidden neurons. 

2.3. FULLY OBSERVED NETWORK 

The gradient of the log-likelihood of fuUy observed networks 
of SRM neurons and similar point-processes have been studied 



We conclude that the gradient V^^^. logp(Xv) can be calculated 
in a purely local manner. More precisely, an update of the weights 
according to gradient ascent AWy oc logp(X\;) yields a learn- 
ing rule that is simply a "trace" of a product of two factors: A 
first factor cpj (f') that depends only on the presynaptic activ- 
ity and a second factor ^gl"^^)^ [Xk (f') - Pk (f')] t^^* depends 
on the state of the postsynaptic neuron. Moreover, the gradient 
(Equation 10) has been shown to yield a simplified form of STDP 
(Pfister et al, 2006). The notion of "trace" (i.e., a temporal aver- 
age of some quantity) will return for other learning rules later in 
this paper. 

In order to expose the weakness of the fully observed model 
described above, we test its performance on a task which consists 
of learning "stair patterns" involving 3 groups of 10 visible neu- 
rons, which are probabilistically activated using low (IHz) and 
high (700 Hz) firing rates. The activations generate a sequential 
pattern where each group remains active for a duration drawn 
from a Gaussian distribution with mean 30 ms and standard 
deviation 10 ms (truncated at positive values), (Figure 3A). This 
benchmark is interesting firstly because it requires the forma- 
tion of memories on the order of three times the membrane 
time constants of the single neuron dynamics and secondly 
because it requires the ability to learn the appropriated transi- 
tion probabilities. Our simulations show that a fully observed 
network is not capable of learning such patterns, as can be seen 
by looking at the samples produced from the learned network 
(Figure 3B). However, a model with 50 hidden neurons (with the 
learning rule discussed further below) can learn the distribution 
(Figures 3C,D). 



Frontiers in Computational Neuroscience 



www.frontiersin.org 



April 2014 I Volume 8 1 Article 38 | 4 



Jimenez Rezende and Gerstner 



Learning in Recurrent spiking networks 




0 200 400 

Time(ms) 



FIGURE 3 I Models with hidden neurons succeed where a fully observed 
model fails. Here we compare our model (online witli variance reduction, 
see IVIaterials and IVIethods) witli and without hidden neurons. (A) Sample 
from the training data. (B) Sample activity from the visible neurons of a 
learned model without hidden neurons. From the generated samples we can 
see that a model with only visible neurons can capture some correlations 
present in the data (note the formation of "downward" moving blurred 
patterns) but fails at capturing the long-term structure of the pattern. (C) 
Sample from a learned model with 50 hidden neurons: hidden neurons' 
spikes (top) and visible neurons' spikes (bottom). Note that the learned 



hidden representation unambiguously represents the global state of the 
visible neurons, generating different spiking patterns for each distinc phase of 
the training pattern. (D) Relative performance of the models with (vertical 
axis) and without (horizontal axis) hidden neurons measured by their data 
log-likelihood. The blue points indicate the relative evolution of the data 
log-likelihood of both models during a learning session. The model with latent 
neurons has systematically a better performance (higher log-likelihood) than 
the model without hidden neurons. (E) Model with hidden neurons in 
inference mode. The activity of the hidden neurons from the Q-network (top) 
forms a higher-level representation of the observed spike trains (bottom). 



Moreover, even if the fully observed network could learn the 
data distribution it would not provide any useful representation 
of the data, while a network with hidden neuron natu- 
rally forms higher-level representations of the incoming data 
(Figure 3E)(top). For precise details concerning the numerical 
simulations and evaluations, see further below. 

This simulation corroborates the intuition that a network 
consisting of visible neurons only is rather limited in scope. 
We therefore turn in the following to a more general network 
consisting of both visible and hidden neurons. 

2.4. PARTIALLY OBSERVED NETWORK 

In the following we first introduce our model that includes hid- 
den neurons. We derive our learning rule and introduce a few 
modifications to improve its performance. Finally, we show with 
simulations that the resulting model can learn spiking patterns 
that couldn't be learned by a model without hidden neurons. 

In a model that includes hidden neurons (that is, neurons not 
directly connected to the incoming data), the marginalized like- 
lihood of the visible neurons is obtained by integrating over all 
possible hidden spike trains X-h , 



p(Xv) 



/ 



(12) 



In the following we introduce the variational approxima- 
tion scheme for approximating Equation (12). The variational 
approach consists of approximating a complex distribution p by 
a simpler distribution q and provides a flexible generalization of 



the expectation-maximization (EM) algorithm (see Jaakkola and 
Jordan, 2000; Beal and Ghahramani, 2006). 

We are interested in approximating the posterior p(X-}i\X\;) 
by another distribution over spike trains q(Xfi |Xv). We optimize 
the parameters of the distribution q{Xfi\Xv) by minimizing the 
KL-divergence 



KL{q;p) = j (X„|Xv) log 



q(Xn\Xv) 
P(Xh I ^v) 



/ 



VXnq(Xn\Xv) log '^^^'^^^^\ + logpCXy) 

p(A-H, Ay) 



= {logqiXnlXv) - logp(X„,Xv)>,(^^|^^) 
-I- logp(Xv) , 



(13) 



Data loe-likelihood 



where (f (X)) p = f 2?X/(x)p(x). The first term inEquation (13) is 
known in statistical physics as the Helmholtz free energy (Landau 
et al, 1980), 



T = {\ozq(Xn\Xv) - logp(X„,Xv)[ 



(14) 



The second term in Equation (13) is simply the data log- 
likelihood. Since the KL-divergence KL{q; p) between two distri- 
butions q and p is non-negative (Gibbs and Su, 2002), the free 
energy (Equation 14) is an upper bound on the negative log- 
likelihood. Therefore, we can redefine the problem of maximizing 
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the data log-likelihood logp(Xv) with respect to the parameters 
of the generative model p as the double optimization problem 
of minimizing the free energy J- with respect to the param- 
eters of q and with respect to the parameters of the original 
model p. 

The attractiveness of such an approach for deriving biolog- 
ically plausible rules comes from the fact that the distribu- 
tion q is arbitrary (as long as it has the same support as the 
true distribution). Therefore we can either choose it in order 
to simplify the calculations or to improve the model's com- 
pliance with known biological constraints, such as causality 
and locality of the weight updates. These interesting proper- 
ties of the variational approjdmation have been explored by a 
diversity of models (Dayan, 2000; Friston and Stephan, 2007; 
Jimenez Rezende et al, 2011; Brea et al., 2013; Nessler et al., 
2013). 

In what follows, we postulate that the posterior distribution 
of the hidden spike trains of the network can be well approxi- 
mated by another recurrent network of spiking neurons which we 
call the network Q. This network is composed of the same neu- 
rons as the original model, except that it has different synaptic 
connections. Its connectivity is depicted in Figure 2D. 

The assumption of an "inference" network Q is analogous 
to the recognition model introduced in Dayan (2000) for the 
Helmholtz machine. 

Our model differs from the Helmholtz machine, as introduced 
in Dayan (2000), in two key aspects: ( 1 ) The Helmohtz machine is 
a model for stationary data, i.e., it cannot readily model temporal 
sequences; (2) Although the recognition network is introduced 
in a variational framework, the proposed learning rule is not 
attempting to minimize a free energy (there are different cost 
functions for the generative and recognition models) whereas our 
learning rule is explicitly attempting to minimize the free energy 
associated to the model. 

In other words, our model consists of a single set of neurons 
with two sets of synaptic weights that can be turned on and off 
independently (possibly through the action of specific neuromod- 
ulators). The first set of weights, which we will refer to as w^, 
parameterizes the original generative model . The second set of 
weights, which we will refer to as parameterizes the network 
Q. The topology of the network Q is restricted and excludes con- 
nections toward the visible neurons from hidden and other visible 
neurons. 

In the following, all the quantities (e.g., membrane potentials 
and firing rates) that are computed using the parameters (i.e., 
with Q turned off) will have the superscript M . Analogously, all 
the quantities that are computed using the parameters (i.e., 
with turned off) have the superscript Q. 

When driven solely by the observed spike trains and by 



the weights w'^ , the activity of the hidden neurons Xu is, by 



construction, an approximated sample from true the posterior 
distribution p(X-}i\Xy) provided by the distribution q{Xi-i\Xy). 
Therefore, if we want our model to perform approximated 
Bayesian inference on the most likely hidden causes of an 
observed spike train Xy we just have to run it with the synapses 
v/f turned off. Conversely if we want to produce a sample from 



the learned generative model, we have to run it with the synapses 
turned off 

y 

In the next sections we derive stochastic estimators of the gra- 
dients of the free energy T with respect to the synaptic weights 
and in a biologically plausible manner. 

We show that the naive gradients obtained for w'^ are prob- 
lematic since their variance grows quadratically with the size of 
the network. We then introduce a simple modification to reduce 
the variance of the obtained gradients based on techniques from 
reinforcement learning. Finally we modify the gradient estimators 
so that they turn into "on-line" parameter updates. 

2.5. STOCHASTIC GRADIENTS 

The complete data log-likelihood £^ of the model Ai and 
of the model Q are given by 

C-^ =\oip{Xv,Xn) 

^ dx[\ogp^(r)X,{x)- p^ix)] (15) 



and 



= \ogq{Xu\Xv) 



= 1] f '^'^ [logP,^(r)X,(r) - pP(r)] (16) 

respectively. 

The free energy (Equation 14) corresponding to the log- 
likelihoods (Equations 15, 16) is given by 



(17) 



In the following we simplify the notation and write (•)q instead 
of (•)i3(X-h|Xv)- wish to write the learning equations for both 



and as simple gradient descent on the free energy: 



V 



(18) 
(19) 



where and fi^ are learning rates for the networks A4 
and Q, respectively. The exact gradients V^a4 J- and V^s JF are 

difficult to evaluate analytically since we cannot compute the 
required expectations. Therefore we resort to unbiased stochastic 
approximations of those gradients. 

The calculation of the gradient V^ai J- is analogous to the fully 

observed case and is given by 



M\ 



(20) 
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where L-^ is a point-estimate of the complete data log-likehhood 
of our generative model obtained by computing the required 
traces from a single simulation of the network Q in the inter- 
val [0, T]. Similarly, the stochastic gradient with respect to 
is given by 



(21) 



where we have used the fact that ( V^^q \ = V^q (exp 



V Q 1 = 0 and J- is the point-estimate of the free energy, 



= f drY, [logpp(r)X,(T) - pP(t)] 

- f dx [logp,^(r)X,(r)-pr^(r)] 

= f dxTr, 
Jo 

where we have defined the "instantaneous free energy" J--c as 
J'r=J2 [HP^mir) - Pp(r)] 



(22) 



- J2 Np,^(r)X,(r)-p,:^(r)]. (23) 

Note that during learning the activity of the visible neurons is 
driven purely by the observed spike trains while the activity of 
the hidden neurons and related quantities (e.g., p® and p^) 
is driven by the Q-network. Expanding the remaining gradients 
in Equations (20, 21) using the chain rule we obtain the "batch 
mode" learning equations 



L 



[^,(O-P,^(f)]0;(f) 

yiJeVuH, (24) 



nT g' (mP(0) 
wf (T) ^ -11^ T / dt ) ( X,(t) - p2 (t) cPj 



(f) 
(25) 



Note that the learning rule (Equation 24) is the same as in the 
fully observed network case in Equation (10). The learning rule 
(Equation 25) for the network Q is similar, but contains an addi- 
tional modulation factor, the point estimate of the free energy J-, 
which appears as a global signal that modulates the learning of 
the network Q. Since J- provides a lower bound on the data log- 
likelihood, the free energy measures how much the recent history 
of observed spike trains "fits" the generative model defined by the 
network A4 . The assumption of a globally available signal con- 
veying information about reward or surprise is standard in the 
reinforcement learning literature. 

The naive stochastic gradients (Equations 24, 25) are not effi- 
cient in practice. Even though they constitute unbiased estimators 
of the true gradients, their variance is prohibitively high. We 
address this problem below. 

2.6. REDUCING THE VARIANCE OF THE GRADIENTS 

Stochastic gradients of the form Equation (25) have been exten- 
sively studied in the reinforcement learning literature and several 
approaches with different levels of complexity have been pro- 
posed for reducing their variance (Munos, 2005; Bhatnagar et al., 
2007). In the following, we sketch some theoretical arguments for 
why gradients of the form of Equation (25) should be expected to 
scale badly with the size of the network. 

Let N be the number of neurons in the network and T be 
the time window on which we compute the relevant quantities. 
Equation (25) contains the free energy J- which is, according to 
Equation (22), a sum of traces (integrals on the interval [0, T]) 
for each neuron. Therefore, in a weak-coupling scenario, the esti- 
mator (Equation 25) is the integral on the interval [0, of a 
sum of N weakly correlated terms in the free energy, that we 
assume to have some typical variance cTq and mean m. Under 
these assumptions, the variance of the gradient (Equation 25) 
scales approximately as 



That is, as the size of the network grows, the variance of the 
gradient (Equation 25) grows with the square of the number of 
neurons. A naive solution would be to decrease the learning rate 
as /i ^ oc 1 /N but this would make learning too slow for larger 
networks. 

In the following, we adopt a simple baseline removal approach 
to reduce the variance of our gradient estimator. That is, we sim- 
ply subtract the mean T of the free energy T, calculated as a 
moving average across several previous batches of length T, from 
the current value T(T). This yields the learning rule 



w^(T)^-n'^~'e{T) dt 



[X,(f)-P,^(f)]0;(t) 

(26) 



where we have introduced the "free energy error signal" e{T) 
T(T) - T. 
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With this simple change, we can see that the variance of our 
gradient scales as 



Var ^wfj oc [^i^f {N x T x a^) 



The baseline removal trick is not a solution to the problem but it 
drastically reduces the variance of our estimator when the size of 
the network is increased. 

Note that the quadratic or linear growth of the gradient's vari- 
ance with the number of neurons is not just an artifact of our 
variational approximation. The arguments presented in this sec- 
tion apply to any learning rule that has the form of local Hebbian 
traces modulated by global signals that are composed of several 
weakly correlated terms. For instance, the learning rule proposed 
in Brea et al. (20 11) also falls into this category. 

2.7. ONLINE vs. BATCH LEARNING 

The gradients that define our learning scheme (Equations 24, 26) 
are given in terms of quantities accumulated over a given time 
interval [0, T]. In this sense, we have derived a "batch" learn- 
ing rule and we would have to wait until the end of the interval 
[0, T] in order to apply the changes in the parameters of our 
model. 

However, compatibility with biology requires that we have 
an online version of our algorithm. This can be approximately 
achieved if, instead of accumulating the traces required for cal- 
culating the gradients on time interval [0, T] and applying the 
parameters updates in the end, we replace the traces by mov- 
ing averages and apply the parameter updates at every time 
step. The modified learning rules can be formulated as fol- 
lows. At each synapse, we have "Hebbian traces" H'^'^{t) 
that keep track of pre- and post-synaptic activity and evolve 
according to 



tGH^f (f) = -Hf (0 + 



rciifit) = -Hf{t) + 



(27) 



[m)-p^(t)\cp,{t), 



(28) 



where we have introduced a time constant tg controlling the 
time-scale of the moving averages. Similarly, the online estimate 
of the error signal e]si( T) is obtained by replacing time integrals in 
the interval [0, T] with moving averages 



^ baseline*^ 



(29) 
(30) 



That is, is a "short term" moving average of the instantaneous 
free energy (Equation 23) with time-scale tg while is a "long 
term" moving average of Tt with a longer time-scale depend- 
ing on xq and Tbaseline- Note that the "error signal" e^iT) can 



also be interpreted as an instantaneous surprise measure rela- 
tive to the slow "background" surprise level J-. The updates of 
the weights uses the Hebbian traces and fixed learning rates /x^ 
and /U.^ 



(f) = M^Hf (f) 



-li'-^eNitW^fit). 



(31) 
(32) 



Thus the update of the A4 -network is given by a "Hebbian" rule 
whereas the update of the Q-network follows a three-factor rule 
with the surprise as a global factor. This architecture is illustrated 
in Figure 4A and the three-factor rule for the Q-synapses is illus- 
trated in Figure 4C. Another way of interpreting the learning rule 
(Equation 32) is that it is simply proportional to the covariance 
between the Hebbian trace and the moving average of the 

free energy T. If they are uncorrelated, the expected change in the 
parameters will be zero and the synaptic weights wiU just perform 
a centered random walk. 

2.8. A SIMPLIFIED MODEL 

Since the variational distribution q in Equation (13) can be arbi- 
trary, one could imagine a model simpler than the one derived in 
the previous sections which consists in approximating the poste- 
rior distribution of the hidden neurons directly by the forward 
dynamics of the generative model. In practice this approxima- 
tion amounts to constraining the synaptic weights of the 

Q-network to be equal to the synaptic weights of the 
-network for i G H and; e V U 

Under this constraint, the learning equations (31) and (32) 
reduces to 



^(0 



,M 



(t) 

tM 



if i e V 



eN(t)H^(t) otherwise 



(33) 



and the instantaneous free energy simplifies to a sum over the 
observed neurons only 



(34) 



!€ V 



The architecture of this model is illustrated in Figure 4B The 
idea of using the forward-dynamics as a proposal distribution for 
the posterior has been used in Brea et al. (2011) and Brea et al. 
(2013), where the proposals are then weighted by an importance- 
sampling scheme to better represent the true posterior distribu- 
tion over the hidden activity. 

Further below (see results) we show that, at least in the con- 
text of the variational learning discussed here, this approximation 
does not outperform our more general model. 

2.9. NUMERICAL SIMULATIONS 

All the simulations in this study are based on a discrete- 
time version of the Equations (31, 32). The spiking process 
is approximated by taking 1 — exp[— dfp(f)] as the probabil- 
ity of producing one spike in the finite time bin [t,t + dt]. 
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Ht)-:F{t) 



FIGURE 4 I The different studied architectures and three-factor STDR 

(A) Illustration of our main model (Equations 31, 32) with its two sets of 
synapses Q (black) and A4 (gray). The "error signal" er^(t) modulates the 
learning of the Q-synapses. (B) Illustration of the simplified model (Equations 
33, 34) with its single set of synapses A4. The "error signal" 6^(1) modulates 
the learning of the -synapses. (C) Learning in the main model's 
Q-synapses is triggered by the covariance between the Hebbian trace H,p 



and the moving average of free energy J^. The inset blue curve indicates the 
shape of the Hebbian trace term as a function of the time interval between 
pre and post synaptic spikes. The dotted gray line corresponds to H? = 0 (no 
weight change). A positive covariance will induce synaptic depression (left 
quadrant of blue curve) while a negative covariance will induce synaptic 
potentiation (right quadrant of blue curve). A null covariance would induce a 
centered random walk on the synaptic weights. 



The values of the parameters used in this study are reported in 
Table 1. 

The initial synaptic weight of both Ai and Q were sam- 
pled from a Gaussian distribution with mean zero and standard 
deviation of 0.01. 

For all experiments, the training data consists of binary arrays 
with ones indicating spikes and zeros indicating no-spike in the 
corresponding time bin. The training data-sets are organized in 
batches of 200 ms which are sequentially presented to the model. 
During our training sessions, each learning epoch corresponds to 
500 presentations of data batches. In other words, each epoch cor- 
responds to a total of 100 s of spiking data. During the learning 
phase, the visible neurons are exactly driven by the data spike 
trains, that is, they are forced to spike or to not spike in the 
exact same way as the data samples at each time bin. During the 
spontaneous activity phase the networks are running without any 
external drive. 

The log-likelihood of test data was estimated by an impor- 
tance sampling procedure. Given a generative model with density 
p(Xv, Xh) over observed variables Xy and hidden variables x/,, 
importance sampling allow us to estimate the density p(Xv) of a 
data point Xy as 

p(Xy) = {pixv\xh})p^ . = {p(Xy\xh)w(xh, ^v)) ( ) , (35) 



Table 1 | Parameters used in the simulations. 



Parameter 


Description 


Value 


dt 


Time discretization interval 


1 ms 


T 


Membrane potential time constant 


10 ms 


10 


Adaptation potential strength 


0.1 mV 


^adapt 


Adaptation potential time-scale 


10 ms 


PO 


Firing rate scale 


1 kHz 


!? 


Firing threshold 


OmV 


AU 


Firing sensitivity to the membrane potential 


1 mV 




Learning rate for the model network JVi 


0.00001 




Learning rate for the recognition network Q 


0.00001 




Time scale for the moving averages of the 
gradients and free energy 


10 ms 


^baseline 


Time scale for the moving average of mean 
free energy 


100 ms 



where q{xi,\Xy) is an arbitrary distribution with same sup- 
port as p(xh) and w(xh,Xy) = p{xh)/qixh\Xy) is the impor- 
tance weight. The Equation (35) can be rewritten in terms 
of the point-estimated of the free energy (Equation 22) as 
follows 



Frontiers in Computational Neuroscience 



www.frontlersin.org 



April 2014 I Volume 8 | Article 38 | 9 



Jimenez Rezende and Gerstner 



Learning in Recurrent spiking networks 



p{Xy) = {p(Xy\Xh)w(Xh, Xy))^^^^^^^^^ 

= {exp[\ogpiXy\xh) + \ogpixh) - logq(xh\Xy)])^^^^^^^ 
= {exp-:F(Xy,Xh)) , (36) 

\ lq(xi,\Xy) 

where J-(Xy, xh) is a point-estimator of the free energy. 

Using Equation (36) we estimate the log-likelihood of a 
observed spike-train by sampling several times from the network 
Q and taking the average of the exponentiated point-estimator of 
the free energy (Equation 22). For our applications, we have gen- 
erated 500 samples of duration 100 ms from the network Q per 
estimation. 

3. RESULTS 

The learning rules derived above (see Materials and Methods) 
come in four different variants. First, a batch-based naive gradi- 
ent descent rule. Second, a variant that reduces the variance of 



the gradient estimator. Third an online version of the variance- 
reduced rule that is biologically more plausible than a batch rule. 
Finally, a simplified version of the model where the Q-network is 
merged with the A4 -network in a specific manner. 

3.1. VARIANCE REDUCED RULE vs. NAIVE RULE 

Naive gradient descent on the free energy yields the batch learn- 
ing rule (Equation 25). Tested on the stairs-patterns (Figure 3A) 
using a network of 30 visible neurons and 30 hidden neurons gen- 
erates very slow learning. In Figure 5A we show that the learning 
rule with the variance reduction (Equation 26) performs substan- 
tially better than the naive gradient (Equation 25). Both networks 
have the same number of visible and hidden neurons, but the 
first network is trained with the gradient given in Equation (25) 
whereas the second network is trained with the gradient defined 
in Equation (26). The data log-likelihood of both networks is 
approximated by importance sampling after every learning epoch. 

In Figure 5A, along the horizontal axis we plot, across sev- 
eral epochs the log-likelihood of the model using the learning 
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FIGURE 5 I Variance reduction and Q-network are important while 
on-line approximation is not. Comparison of the different flavors of the 
proposed model across 100,000 learning epochs on the "stairs pattern" task. 
Shown log-likelihoods where estimated by importance sampling every 500 
epochs during learning. Likelihood values (crosses) further to the right (high 
log-likelihood) correspond to the end of learning while the cross on the 
diagonal marks the beginning of learning (epoch 1). (A) Model with variance 



reduction (horizontal axis) achieves a higher log-likelihood than a naive model 
without variance reduction (vertical axis). (B) Batch model with variance 
reduction(horizontal axis) and on-line model with variance reduction (vertical 
axis) exhibit a similar evolution of log-likelihoods. (C) Online-model with 
variance reduction (horizontal axis) performs better than the simplified model 
(without the Q-network) with variance reduction defined by Equations (33, 
34) (vertical axis). 
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rule with variance reduction and along the vertical axis the log- 
likelihood of the model without variance reduction (i.e., the naive 
batch gradient rule). We find that the log-likelihood of the vari- 
ance reduced model is consistently above the log-likelihood of 
the naive model, indicating an increase in learning speed of more 
than a factor 100. 

3.2. ONLINE ALGORITHM vs. VARIANCE REDUCED BATCH ALGORITHM 

Online algorithms are biologically more plausible than batch 
algorithms which require storage of intermediate results. Here we 
show that the online learning rules (Equations 31, 32) induce only 
minor impairments of the performance compared to the batch 
rules (Equations 24, 26). 

Both versions of the model were trained on the same "stair 
pattern" task and the results are shown in Figure 5B. Along the 
horizontal axis we indicate the log-likelihood of the batch model 
and along the vertical axis that of the online model, across several 
epochs of learning. Both performances are strongly correlated, 
indicating that the online approximation does not introduce any 
major impairment in the model for this particular dataset. 

3.3. FORWARD DYNAMICS vs. THE Q-NETWORK 

Here we show that the simplified model defined by the Equations 
(33, 34) does not reach the performance of the more general 
model defined by Equations (31, 32). 

Both versions of the model were trained on the same "stair 
pattern" task and the results are shown in Figure 5C. Along the 
horizontal axis we indicate the log-likelihood of our model with 
Q-network and along the vertical axis that of the simplified 
model, across several epochs of learning. Both performances are 
correlated. However, the simplified model has clearly lower log- 
likelihoods than the complete model. This result suggests that the 
forward dynamics of the generative model may provide a poor 
approximation to the true posterior distribution of the hidden 
neurons compared to having an independently parameterized 
inference network. 

3.4. HIDDEN REPRESENTATIONS AND INFERENCE 

To show that our model is not only learning a prior that represents 
the data but can also form interesting hidden representations and 
perform inference on the hidden explanations of the incoming 
data, we use the Q-network of a model with 50 hidden neurons, 
trained on the stairs dataset Figure 3A. The activity of the hid- 
den neurons of the model during sampling (or "dreaming") are 
shown on Figure 3C (top). As we can see just by visually inspect- 
ing the activity of the hidden neurons, they form an unambiguous 
representation of the activity of the visible neurons in this simple 
example. Conversely, by running the model on "inference mode" 
(i.e., with the synapses activated) can also see that the model 
is capable of performing inference on the causes of the incoming 
data Figure 3E. 

3.5. THE ROLE OF THE NOVELTY SIGNAL 

The resulting online rule given in Equation (32) for is a Hebbian- 
type plasticity rule modulated by a global novelty signal e^it) 
where eAr(f) is a measure of surprise relative to a slow moving 
average of the free energy. We repeat the learning rule for the 



Q-network from Equation (32), 

wf(t) = -ii^eN(t)Hf{t), (37) 

where (f) is a synaptic trace that keeps track of "Hebbian" 
coincidence between pre- and post-synaptic activity. 
The Hebbian trace 

g' (u^it)) 

xcHfit) = -Hf{t) + ; ' ( \x,{t) - p^(t) (Pj{t)08) 

is activated by the product of a voltage-dependent post-synaptic 

.?'("p(f)) r o 1 

factor ) Q — \X,(t) — p (t) and a presynaptic EPSP caused 
(t)j L ' J 

by presynaptic spike arrival. 

It is known that the brain is able to detect novelty and to broad- 
cast novelty related signal across large brain regions (Gu, 2002; 
Ranganath and Rainer, 2003). In order to illustrate the dynamics 
of the novelty signal in our model we describe a task which could 
be transformed into a real animal experiment. 

We place an agent (e.g., a rat) in a maze and let it explore it. 
During the exploration, the agent will learn the topology of the 
maze through a combination of visual and proprioceptive infor- 
mation. If the agent is suddenly transported to another maze with 
many similar but a few different parts we expect this change to 
trigger some novelty signal in the brain of the agent whenever it 
encounters the parts of the new maze that differ from the learned 
maze. In order for the agent's brain to detect a change in the envi- 
ronment, it must first learn a sufficiently accurate model of the 
environment. We hypothesized that our recurrent neuronal net- 
work learns the structure of the environment and at the same time 
provides an online surprise signal when it encounters a "novel" 
situation. Such a novelty signal could be compared to recordings 
of different neuromodulators. 

We created two virtual mazes composed of 16 "rooms" 
arranged as a square lattice where only neighboring rooms are 
accessible from each other: a first one which the agent will learn, 
the target maze, indicated in Figure 6A (left) and the test maze 
Figure 6A (right) which is similar to the first but with a few, 
randomly chosen rooms, replaced as indicated in Figure 6A by 
the yellow squares. Note that the only way to detect the differ- 
ence between the two mazes in this simulation is to actually learn 
the exact connectivity graph of the rooms, because the rooms are 
themselves identical in both mazes. 

The views corresponding to each room were generated by ran- 
domly choosing images of handwritten digits from the MNIST 
dataset (LeCun and Cortes, 2010). The MNIST images are 28 x 
28 gray-scale images of handwritten digits. We converted the 
pixel-values to firing rates in the range [0.01 Hz, 9 Hz]. To keep 
the simulations simple, time was considered in abstract units for 
these simulations (with time steps of 100 ms instead of 1 ms). 

For the "brain" of the agent we used a recurrent binary net- 
work with 30 hidden neurons and 28 x 28 visible neurons. In 
order to train this network, data batches where produced by 
recoding the activity of the visible neurons while the agent per- 
forms random trajectories of 100 time-steps in the target maze. 
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FIGURE 6 I Novelty-gated Hebbian learning. (A) Tine target maze (left) 
and test maze (riglit). Eacli one of the 16 rooms in the maze is 
represented by a random MNIST digit. Transitions are only possible 
between neighbor rooms. The test maze was generated from the target 
one by randomly replacing a few rooms (indicated with yellow borders). A 
sample trajectory is shown with the green arrows. (B) Evolution of the 
slow moving average of the free energy JF as a function of the amount of 



observed data for the target maze (blue) and the mean free energy of the 
same model when "teleported" to the control maze (red) every 500 s. (C) 
Fine-grained temporal details of the fast free energy traces ew(0 while 
traversing the sample trajectory for the target maze (blue) and for the test 
maze (red). Visited rooms are indicated by the annotated coordinates at 
each change-point. During the intervals between two consecutive 
change-points the agent remains in the same room. 



Each learning epoch corresponds to 500 presentations of these 
data-batches to our model. 

In Figure 6B we plot the "slow" moving average of the free 
energy J- as a. function of the learning time during the explo- 
ration phase for the target maze (blue curve) and for the test 
maze (red dots). The reported free energy for the control maze 
was measured every 5000 s for a path of length 50 s. As we can 
see, in the beginning of the learning the model is unable to dis- 
tinguish between both mazes (both have high free energy). But 
as the model learns the target maze, it successfully identifies the 
test maze as "unfamiliar" (attributing low free energy to the target 
maze and much higher free energy to the test maze). 

In Figure 6C we plot for both mazes the free energy error sig- 
nal eN(t) for the sample trajectory shown in Figure 6A. From 
Figure 6C (blue) we can see that ejv(f) fluctuates around zero for 
the learned maze but deviates largely from zero for the test maze. 

This result suggests that if animals have a neurophysiologi- 
cal correlate of the "free energy error signal" e{T) introduced in 
Equation (26) we should look for activity bursts when the ani- 
mals traverses unexpected situations (e.g., when traversing the 
test maze from position (2, 1) to position (2, 2)). 

Moreover we should expect a substantial increase in the vari- 
ance of the changes in synaptic weights when moving from a 



learned maze to an unfamiliar maze due to the change in the 
baseline of the surprise levels. 

4. DISCUSSION 

We have proposed an alternative to the learning algorithms pre- 
viously proposed in Brea et al. (20 11) and Jimenez Rezende et al. 
(20 11) for learning a generative model of spike trains defined by 
recurrent spiking networks. Our new model combines techniques 
from variational learning and reinforcement learning to derive a 
new efficient synaptic plasticity rule. 

The resulting (see Materials and Methods) online rule for 
synapses is a Hebbian-type trace modulated by a global novelty 
signal. The Hebbian terms are traces of products of pre-synaptic 
terms (EPSPs) and post-synaptic terms. Similar gradients have 
been studied in Pfister et al. (2006) where they are found to 
yield STDP-like dynamics. Importantly, the global modulating 
signal. Equation (23) is a linear superposition of terms locally com- 
puted by each neuron so we can interpret it as the diffusion of a 
neuromodulator in the extra-cellular medium. 

The original feature of the proposed model is that it uses an 
auxiliary recurrent spiking network in order to approximate the 
posterior distribution of the hidden spiking activity given the 
observed spike trains in an on-line manner. Using this auxiliary 
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network allowed us to derive a learning rule which is based on 
local gradients modulated by slow non-local factors conveying 
information about "novelty." 

We have shown that naive stochastic gradients derived in such 
a framework are not viable in practice due to their high vari- 
ance (which may grow quadratically with the number of hidden 
neurons). Deriving viable learning rules thus requires finding 
low- variance and unbiased estimators of the gradients defined in 
Equations (24, 25). In this paper we have only reduced this prob- 
lem as our learning rule (Equation 26) still has a variance that 
grows (linearly) with the number of hidden neurons. 

Our proposed learning algorithm, has potential applications 
for finding functional networks from recorded neurophysiolog- 
ical data. Since it can learn a recurrent spiking network that 
approximatively "explains" the data. Taking into account external 
currents injected into the network would be straightforward. 

We also provide an on-line neural estimator of nov- 
elty/surprise and make experimentally testable predictions about 
its dynamics. The estimator is a quantity that naturally emerges 
from the statistical principles behind our framework instead of 
being an ad hoc quantity. 

From the biological literature, it is known that novelty or sur- 
prise is, at least partly, encoded in neuromodulators such as ACh 
(Ranganath and Rainer, 2003; Yu and Dayan, 2005). Moreover, 
neuromodulators are known to affect synaptic plasticity (Gu, 
2002). We suggested a hypothetical animal experiment that could 
test predictions concerning a novelty signal and its potential 
relation to plasticity. 

5. RELATED WORK 

The framework in which we derive our model is very general and 
has been used in different ways in previous works (Dayan, 2000; 
Friston and Stephan, 2007; Jimenez Rezende et al, 2011; Brea 
et al, 2013; Nessler et al., 2013). The uniqueness of the present 
works relies on the combination of methods from variational 
learning and reinforcement learning yielding a learning rule that 
is both biologically plausible and efficient. 

The main difference between our model and models that 
exploit more analytical properties of the variational approxima- 
tion (explicitly computing expectations and covariances terms in 
the free energy) (Friston and Stephan, 2007; Jimenez Rezende 
et al., 2011) is that, although these analytical approximations 
may provide gradient estimators with much lower variance and 
typically yield more scalable algorithms, these methods are intrin- 
sically non-local and further approximations are required in order 
to obtain a biologically plausible learning rule. 

An interesting learning algorithm for recurrent spiking net- 
works which approximates the gradients of the KL-divergence 
(Equation 13) using an importance sampling technique has been 
proposed in Brea et al. (2011) and Brea et al. (2013). Their algo- 
rithm does not use an auxiliary network to approximate the 
activity of the hidden dynamics conditioned on the observed 
spike trains. Instead, the generative forward dynamics of the 
model is used as a proposal distribution which is then weighted 
by an importance-sampling approximation. 

Another important family of algorithms proposed in 
Habenschuss et al. (2012) and Nessler et al. (2013), heavily 



relies on approximating recurrent neural networks with soft 
winer-takes-all (WTA) dynamics. One advantage of assuming 
a WTA dynamics is that the computation of the gradients of 
the KL-divergence (Equation 13) greatly simplifies, yielding a 
simple local learning rule. A limitation of their approach is that 
the model does not take into account the fuU temporal dynamics 
during inference. 
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